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Abstract 

Topological properties of networks are widely applied to study the link-prediction problem recently. Common Neighbors, 
for example, is a natural yet efficient framework. Many variants of Common Neighbors have been thus proposed to further 
boost the discriminative resolution of candidate links. In this paper, we reexamine the role of network topology in 
predicting missing links from the perspective of information theory, and present a practical approach based on the mutual 
information of network structures. It not only can improve the prediction accuracy substantially, but also experiences 
reasonable computing complexity. 
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Introduction 

Link prediction attempts to estimate the likelihood of the 
existence of links between nodes based on the available network 
information, such as the observed links and nodes' attributes [1,2]. 
On the one hand, the link-prediction problem is a long-standing 
practical scientific issue. It can find broad applications in both 
identifying missing and spurious links and predicting the candidate 
links that are expected to appear with the evolution of networks 
[1,3,4]. In biological networks (such as protein-protein interaction 
networks [5] and metabolic networks [6]), for example, the 
discovery of interactions is usually costly. Therefore, accurate 
prediction is more reasonable compared with blindly checking all 
latent interaction links [3,4] . In addition, the detection of inactive 
or anomalous connections in online social networks may improve 
the performance of link-based ranking algorithms [7] . Further- 
more, in online social networks, very promising candidate links 
(non-connected node pairs) can be recommended to the relevant 
users as potential friendships [8,9]. It can help them to find new 
friends and thus enhance their loyalties to the web sites. In ref. [9] , 
the authors even proposed the potential theory to facilitate the 
missing link prediction of directed networks. The hypothesis can 
find broad applications in friendship recommendation of large- 
scale directed social networks, such as Twitter, Weibo and so on. 
On the other hand, theoretically, link prediction can provide a 
useful methodology for the modeling of networks [10]. The 
evolving mechanisms of networks have been widely studied. Many 
evolving models have been proposed to capture the evolving 
process of real-world networks [11-14]. However, it is very hard to 
quantify the degree to which the proposed evolving models govern 
real networks. Actually, each evolving model can be viewed as the 
corresponding predictor, we can thus apply evaluating metrics on 



prediction accuracy to measure the performance of different 
models. 

Therefore, link prediction has attracted much attention from 
various scientific communities. Within computer society, for 
example, scientists have employed Markov chains [15,16] and 
machine learning techniques [17-21] to extract features of 
networks. These methods, however, depend on the attributes of 
nodes for particular networks such as social and textual features. 
Obviously, the attributes of nodes are generally hidden, and it is 
thus difficult for people to obtain them [2] . 

Over the last 15 years, network science has been developed as a 
novel framework for understanding structures of many real-world 
networked systems. Recently, a wealth of algorithms based on 
structural information have been proposed [2,4,22-28]. Among 
various node-neighbor-based indices, Common Neighbors (CN) is 
undoubtedly the precursor with low computing complexity. It has 
also been revealed that CN achieves high prediction accuracy 
compared with other classical prediction indices [25]. CN, 
however, only emphasizes the number of common neighbors but 
ignores the difference in their contributions. In this case, several 
variants of CN to correct such a defect were put forwarded. 
Consider, for example, Adamic-Adar [24] and Resource Alloca- 
tion [25], in which low-degree common neighbors are advocated 
by assigning more weight to them. In addition, based on the 
Bayesian theory, a Local Nave Bayes model [2 7] was presented to 
differentiate the roles of neighboring nodes. Furthermore, node 
centrality (including degree, closeness and betweenness) was 
applied to make neighbors more distinguishable. Besides such 
CN-based indices, the evolving patterns and organizing principles 
of networks can also provide useful insights for coping with the 
link-prediction problem. The well-known mechanism of preferen- 
tial attachment [1 1], for instance, has been viewed as a prediction 
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measure [25,29]. For networks exhibiting hierarchical structure, 
Hierarchical Random Graph can be employed to predict missing 
links accordingly [4] . Recently, communities have been reinvented 
as groups of links rather than nodes [30]. Motivated by the shift in 
perspective of communities, Cannistraci et al. developed the local- 
community-paradigm to enhance the performance of classical 
prediction techniques [28]. 

All the aforementioned methods aim to quantify the existence 
likelihood of candidate links. In information theory, the likelihood 
can be measured by the self-information. In this article, we thus try 
to give a more theoretical analysis of the link-prediction problem 
from the perspective of information theory. Then a general 
prediction approach based on mutual information is presented 
accordingly. Our framework outperforms other prediction meth- 
ods greatly. 

Results 

A Mutual Information Approach to Link Prediction 

We here introduce the definitions of the self-information and of 
the mutual information, respectively. 

Definition 1 Considering a random variable X associated with 
outcome x k with probability p(xk), its self-information I(xk) can 
be denoted as [31] 

I(x k ) = log -— = - \ogp(x k ), ( 1 ) 

P(xk) 

where the base of the logarithm is specified as 2, thus the unit of 
self-information is bit. This is applicable for the following if not 
otherwise specified. The self-information indicates the uncertainty 
of the outcome x k . Obviously, the higher the self- information is, 
the less likely the outcome Xk occurs. 

Definition 2 Consider two random variables X and Y with a 
joint probability mass function p(x,y) and marginal probability 
mass functions p{x) and p(y). The mutual information I{X; Y) 
can be denoted as follows [32]: 

= EK^)iog^ (2) 

x,y 

= J2p(x,y)\og<j^. 



node pair (x,y), the set of their common neighbors is denoted as 
O xy = T(x)C\T(y). 

Given a disconnected node pair (x,y), if the set of their common 
neighbors O xy is available, the likelihood score of node pair (x,y) is 
defined as 

s%!=-I{L\ } \O xy ), (4) 

where I{L xy \O xy ) is the conditional self-information of the 
existence of a link between node pair (x,y) when their common 
neighbors are known. According to the property of the self- 
information, the smaller I(L xy \O xy ) is, the higher the likelihood of 
existence of links is. Thus, we define the score as the negation of 
I(L xy \O xy ). According to the definition of mutual information, 
I(L xy \O xy ) can thus be derived as 

I(L l xy \O xy ) = I(L 1 xy )-I(L 1 xy ; O xy ), (5) 

where I{L X ) is the self-information of that node pair (x,y) is 
connected. I(L X ; O xy ) is the mutual information between the 
event that node pair (x,y) has one link between them and the 
event that the node pair's common neighbors are known. Note 
that I(L xy ) is calculated by the prior probability of that node X and 
node y are connected. In our method, without knowing the 
common neighbors of node pair (x,y), we could use I(L xy ) to 
estimate the existence of a link between node pair (x,y). 
I(L xy ; O xy ) indicates the reduction in uncertainty of the connec- 
tion between nodes x and y due to the information given by their 
common neighbors. Since the mutual information plays a 
significant role in our method, this framework is called MI for 
short. 

If the elements of O xy are assumed to be independent of each 
other, then 

I(L l xy ;O xy )= £ I(L xy ;z). (6) 

z€O xy 

Here I(L xy ; z) can be estimated by I(L l ; z), which is defined as 
the average mutual information over all node pairs connected to 
node z 



Hence, the mutual information I(x k \yj) = l(X = x k \ Y = yj) 7(L 1 ;z)= * — - V I(L l mn ,z). (7) 

can be obtained as 11 ^Kl 1 ( z >\ ~ l > m _ Ln 



I(x k ;yj) =log-^ 



m,neT(z) 



= - \ogp(x k ) - ( - logp(x k \yj)) W 

Now we try to calculate the above mutual information. 
= I(x k ) — I(x k \yj). According to its definition (3), I(l} mn \z) can be denoted as 



The mutual information is the reduction in uncertainty due to 
another variable. Thus, it is a measure of the dependence between 
two variables. It is equal to zero if and only if two variables are 
independent. 

Now consider the link-prediction problem. Our idea is to use 
the local structural information to facilitate the prediction. To do 
that, we denote the set of node x's neighboring nodes by T(x). For 



I(Ll n ;z) = I(L\ m )-I(Ll n \z), (8) 

where I{L} mn \z) is the conditional self-information of that node pair 
(m,n) is connected when node z is one of their common neighbors, 
and I(L) m ) denotes the self-information of that node pair (m,n) 
has one link. The right-hand side of eq. (8) is composed of the 
(conditional) self-information. Based on the definition of 
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(conditional) self-information, it can be calculated based on the 
(conditional) probability. 

The conditional probability p(L} mn \z) can be estimated by the 
clustering coefficient of node z, defined as 



a* vo m 



N A: +N A: 



(9) 



where N& : and are the numbers of connected and of 

disconnected node pairs with node z being a common neighbor, 
respectively. Once p(L\ m \z) is available, I(L) nn \z) can be 
calculated. 

In order to calculate the probability p(L) lm ), we assume that no 
degree-degree correlation is considered. When nodes' degrees are 
known, the probability that node pair (m,n) is disconnected is 
derived as 
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where k m and k„ are the degrees of nodes m and n, respectively. 
M is the total number of links in the training set. Obviously this 
formula is symmetric, namely 
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(12) 



and I(L nm ) can be calculated accordingly. 
Collecting these results, we can obtain 

/(4 v ;z)«/(L';z) = lrl2) ^ (:)l _ V) £ (/(L,'„„)-/(Z,'„„|_-)) 
m,neT(z) 

= in-)i(in-)i-i) £ ( - h SP( L D ~ ( - l °SP( L L I- ) ) ) 
m,«er(z) 



m^n m ^M-k m 

m,ner(z) 




Figure 1. An illustration about the calculation of Ml model. 

doi:10.1371/journal.pone.0107056.g001 
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It is stipulated that /(Z, 1 ; z) = 0 if TVaz = 0. 
Based on the above derivation, we have 

v'" =-I{L\ y \O xy ) 



£ 7(4, ;z)-7(£i ), ( 14 ) 



where /(7j T ; z) and 7(Z,L,) can be calculated by eqs. (13) and (12) 
respectively. 

To facilitate the understanding of MI, we illustrate it with an 
example as shown in fig. 1. First, consider node ol, for example, 
which is the common neighbor of nodes o2, t>3 and o4. Using eq. 
(9), we can have 

/(L,i 2o3 |vl) = /( J L, 1 )3 „ 4 |vl) = /(Lj 2 , )4 |vl)=log3 = 1.585. Based on 
eq. (12), we obtain 7(Z4 1)3 )= log f = 1.737, 
I{L\ 2lA )= log ^=0.9069 and /(Lj 3o4 )= log 5 = 2.3219. Hence, 
we have I(L , ul) = 0.0703. Now we compare node pairs (o2,o3) 
and (o3,l4) with the common neighbor node ol. Then 
J ( L l2o3 1 O v2 a)= 1.6667, /(Lj 3[)4 |O o3u4 ) = 2.2516, which can be 
calculated based on eq. (5). That is to say, node pair (t>2,o3) is 
more likely to be connected than node pair (d3,o4). The six 
prediction methods mentioned in section "Previous Prediction 
Methods", however, cannot distinguish these two node pairs. In 
this sense, MI has higher discriminative resolution than them. 
Second, MI can distinguish node pairs even if they all have no 
common neighbors. For instance, 7(-Lj 3l)5 )= logy = 1.7370 and 
7(L' 3ug ) = log 5 = 2.3219. That is to say, node pair (u3,o5) is more 
likely to be connected than node pair (i>3,i>8). This is undoubtedly 
beyond the distinguishing ability of previous methods. Thirdly, the 
mutual information of node u6 can be calculated as 
/(L 1 ;u6) = /(L 1 ;u7) = 0.1854. Thus I(Ll 5 ^\O v5vB ) = 0.5361. We 
note that I(L l u5DS \O u 5 v g)<I(Ll 2D3 \0 D 2v3), namely, node pair (o5,o8) 
with two common neighbors has higher connection likelihood 
compared to node pair (t>2,t;3) with only one common neighbor. 
This is in agreement with our intuition very well. Lastly, different 
nodes may provide different mutual information to reduce the 
uncertainty of connections. The extent to which node 06 
(I(L l ; u6) = 0. 1 854) contributes to the reduction of link uncertain- 
ty, for example, is greater than that of node ol 
[I(L i ;vl) = 0.0703). 

Experimental Results 

In this section, we compare our MI approach with other six 
representative prediction indices which are introduced in section 
"Previous Prediction Methods". Tables 1 and 2 show the 
prediction accuracy measured by AUC and precision, respectively. 
The overall prediction performance of MI outperforms them 
greatly. 

Table 1 demonstrates that for AUC, MI model gives much 
higher prediction accuracy than all 6 other indices for real-world 
networks except network Grid. Especially for networks EPA and 
INT, AUC of six indices is all around 0.6. MI model can 
experience AUC of more than 0.9. Such great difference may arise 
from that previous methods can't distinguish those node pairs 
without common neighbors. Unfortunately, the lack of common 
neighbors between two nodes often appear in real-world networks. 
For example, more than 99% of node pairs in network INT have 
no common neighbors. But MI approach is able to discriminate 
them greatly. Another finding is that CAR-based indices (CAR 
and CRA) achieve the worst prediction performance for ten 
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networks. Actually, for node pairs with few common neighbors, 
the distinguishing ability of CAR-based indices degenerates 
remarkably due to their emphasis on the links among common 
neighbors. For example, all node pairs with less than two common 
neighbors share the same connection likelihood because they all 
have no links among common neighbors. 

Table 2 shows the comparisons of precision for ten real-world 
networks. We can see that MI is much better than CN, RA, LBN- 
CN, and LNB-RA for all networks. CAR-based indices, however, 
achieve higher precision than MI for some networks. The 
efficiency of CAR-based indices in predicting top-ranked candi- 
date links is very high for networks with notable link communities. 
Consider, for example, network Wikivote with high average 
degree, in which CAR-based indices overwhelmingly win MI and 
other methods. Obviously, the extent to which CAR-based indices 
excel MI is positively related to link communities. The computing 
complexity of CAR-based indices, however, depends on the 
density of networks greatly. 

It is thus necessary to compare the computing complexity of 
CAR-based indices and our MI model. Here the average degree is 
denoted as According to eq. (23), the time complexity of 

computing y(z) and O xr is 0{(k)> A ) and 0((k)> 2 ), respectively. 
The total computing complexity of CAR is thus 0(N 2 -(ky ). 
Similarly to CAR, the computing complexity of CRA is also 
0(N 2 -(ky 6 ) because T(z) has the computing complexity of 0(1) 
based on eq. (24). For MI, the computing complexity of I(L l mn ) 
and averaging all neighboring node pairs of node z is both 
0((/:) 2 ). Thus, I{L x ;z) has the computing complexity of 
0((^) 4 ). The computing complexity of MI model can be derived 
as 0(N 2 -(ky*) accordingly. Taking precision and the computing 
complexity of CAR-based indices together, we note that they 
outperform MI in some networks but with the computing 
complexity as (,k~) 2 times as that of MI. It is intolerable especially 
for networks with the high average degree. 

We also conduct experiments on an ASUS RS500-E6-PS4 
workstation with 16 GB RAM and a Inter (R) Xeon (R) E5606 @ 
2.13 GHz quad-core processor. The detailed comparison of 
computational time on ten real-world networks is summarized in 
Table 3. The results indicate that the MI index overwhelms CAR- 
based methods while remains similar time scale to other CN-based 
methods. 

Altogether, MI has a good tradeoff among AUC, precision and 
the computing complexity. 

Discussion 

In this paper, we develop a novel framework to uncover missing 
edges in networks via the mutual information of network topology. 
Note that our approach differs crucially from previous prediction 
methods in that it is derived from the information theory. We 
compare our model with six typical prediction indices on ten 
networks from disparate fields. The simulation results show that 
MI model overwhelms them. Furthermore, we compare the 
computing complexity of MI model with that of CAR-based 
indices and find that our approach is less time-consuming. 

Notice that the calculation of the mutual information depends 
on the assumption that the network is free of assortativity. 
However, we find that MI method performs very well not only in 
uncorrelated networks but also in networks with high assortativity 
coefficient such as PB, Yeast and EPA. Actually, the assortativity 
coefficient refers to the global network-level property [33] as 
showed in Table 4, which can't convey sufficient property about 
local structure. Considering that our method is mainly focusing on 
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co a* cri 



the neighbors of two nodes, we utilize local assortativity [34,35] to 
explain such a phenomenon. For a network with N nodes and M 
links, its excess degree (which is equal to the node's degree minus 
one) distribution is denoted as q(k). Then, the local assortativity of 
node v is defined as [35] 



j(j+l)(k-ii,) 



2Mol 



(15) 
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where y is node v's excess degree, k denotes the average excess 
degree of node v's neighbors, fi q is denned as the expectation of 
distribution q(k) and a q is the standard deviation of distribution 
q(k). Based on this definition, the sum of all nodes' local 
assortativity is equal to the network assortativity coefficient. 
Fig. 2 shows the cumulative distribution function of nodes' local 
assortativity. We find that i) both locally assortative and 
disassortative nodes exist regardless of the network-level assortative 
mixing pattern; ii) most nodes do not show the local assortative 
property, which is coincident with our assumption. Since our 
method is related to the local assortativity rather than the global 
one, it can achieve good prediction performance even in those 
globally correlated networks. 

Materials and Methods 

Data and Problem Description 

In this article, in order to better capture the statistical 
perspective of our method, we choose ten example data sets from 
various areas with the size of its giant component being greater 
than 1000. They are listed as follows, i) Email [37]: A network of 
Alex Arenas's email, ii) PB [38]: A network of the US political 
blogs. iii) Yeast [39]: A protein-protein interaction network, iv) 
SciMet [40]: A network of articles from or citing Scientometrics. v) 
Kohonen [40]: A network of articles with topic self-organizing 
maps or references to Kohonen T. vi) EPA [41]: A network of web 
pages linking to the website www.epa.gov. vii) Grid [12]: An 
electrical power grid of the western US. viii) INT [42] : The router- 
level topology of the Internet, ix) Wikivote [43,44]: The network 
contains all the Wikipedia voting data from the inception of 
Wikipedia till January 2008. x) Lederberg [45]: A network of 
articles by and citingj. Lederberg, during the year 1945 to 2002. 
Here we only focus on the giant component of networks. Their 
basic topological parameters are summarized in Table 4. 

In this paper, only an undirected simple network G(V,E) is 
studied, where V and E are the sets of nodes and of links, 
respectively. That is to say, the direction of links, self-connections 
and multiple links are ignored here. The framework of prediction 
indices can be described as follows [2] . Given a disconnected node 
pair (x,y), where x,yeV, we should try to predict the likelihood of 
connectivity between them. For each non-existent link 
(x,y)€U — E, where U represents the universal set, a score S xy 
will be given to measure its existence likelihood according to a 
specific predictor. The higher the score is, the more possible the 
node pair has a candidate link. To figure out the latent links, all 
disconnected ones are first sorted in the descending order. The 
top-ranked node pairs are believed most likely to have links. 

To validate the prediction performance of the algorithms, the 
observable links of the network are divided into two separate sets, 
i.e., the training set E T and the probe set E p . Obviously, E T is the 
available topological information, and E p is for the test and thus 
cannot be used for prediction. Therefore, E T \JE P = E and 
E T f~]E F = f). In our model, the training set E T and probe set E 
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PB (r = -0.2213) Yeast (r = 0.4539) EPA (r = -0.3041) 




0.01 



Figure 2. Cumulative distribution function of local assortativity, /-(/>) vs p, for networks PB, Yeast and EPA respectively, where F(p) 
is denoted as the percent of nodes with the local assortativity value not larger than p. r is the assortativity coefficient of the network 
which is presented in Table 4. 
doi:1 0.1 371 /journal.pone.01 07056.g002 



are assumed to contain 90% and 1 0% of links, respectively (see the 
review article [2] and references therein). 

As in many previous papers, two widely used metrics are 
adopted to evaluate the performance of prediction algorithms [2] . 
They are AUC (area under the receiver operating characteristic 
curve) [46] and precision [47] . AUC is denoted as follows: 



AUC-- 



-0.5n" 



ri 



(16) 



where among n times of independent comparisons, ri and n" 
represent the time that a randomly chosen missing link has a 
higher score and the time that they share the same score compared 
with a randomly chosen nonexistent link, respectively. Clearly, 
AUC should be around 0.5 if all scores follow an independent and 
identical distribution. Therefore, as a macroscopic accuracy 
measure, the extent to which AUC exceeds 0.5 indicates the 
performance of a specific method compared with pure chance. 
Another popular measure is precision, which focuses on top- 
ranked latent links. It is defined as L r /L, where among top-L 
candidate links, L r is the number of accurate predicted links in the 
probe set. 

Previous Prediction Methods 

We here introduce six typical methods based on common 
neighbors. They are Common Neighbors (CN), Resource 
Allocation (RA) [25], the Local Naive Bayes (USB) forms of CN 
[27] and RA [27], CAR [28] and CRA [28], respectively. 

• CN. This method is the natural framework in which the 
more nodes x and y share common neighbors, the more 
likely they are connected. The score can be quantified by 
the number of their common neighbors, namely 



= ir(.v)nr(y)i = |o,,|. 



(17) 



• RA. In this method, the weight of the neighboring node is 
negatively proportional to its degree. The score is thus 
denoted as 



: E 

zeo.w 



|r(*)| 



(18) 



LNB-CN. Based on the naive Bayes classifier, this method 
combines CN and the clustering coefficient together. The 
score is defined as 



(19) 



zSOxy 



In this formula, r\ is denoted as 

\v\{\v\-\) 



1= 



2\E T \ 



(20) 



In addition, R : is defined as 



R-- 



Naz + 1 



(21) 



where and are as same as those in eq. (9) 



LNB-RA. Similarly to LNB-CN, this method takes RA and 
the clustering coefficient into account. The score is thus 
denoted as 



E 



ir(z 



-(logf/+ \ogR : ). 



(22) 



CAR. This method boosts the discriminative resolution 
between latent links characterized by the same number of 
common neighbors through further emphasizing the link 
community among such common neighbors. Thus, it is 
described as 



-\O xy \- £ 



2 ' 



(23) 



where y(z) refers to the subset of neighbors of node z that 
are also common neighbors of nodes x and y. 



PLOS ONE | www.plosone.org 



7 



September 2014 | Volume 9 | Issue 9 | e1 07056 



Link Prediction and Mutual Information 



• CRA. This method is a variation of CAR when RA is 
considered. It can be thus denoted as 



cra = v- m z >\ (24) 
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